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This paper is concerned with estimating the intersection point of 
two densities, given a sample of both of the densities. This problem 
arises in classification theory. The main results provide lower bounds 
for the probability of the estimation errors to be large on a scale de- 
termined by the inverse cube root of the sample size. As corollaries, 
we obtain probabilistic bounds for the prediction error in a classifica- 
tion problem. The key to the proof is an entropy estimate. The lower 
bounds are based on bounds for general estimators, which are ap- 
plicable in other contexts as well. Furthermore, we introduce a class 
of optimal estimators whose errors asymptotically meet the border 
permitted by the lower bounds. 

1. Introduction. 

1.1. Motivation and origin of the problem. In this paper we derive lower 
bounds for the probability of large errors of some estimators to occur. Let 
■p be a class of probabihty measures on a measurable space {Q,A), and 
a : P — > M be a parameter. Consider an i.i.d. random sample Zi, . . . ,Zn from 
P G "P and an estimator a„,(Zi , . . . , Zn), an ■ i^"" — > K, for a. We are interested 
in the asymptotic behavior of as n — > oo. 

In the theory of empirical processes one usually considers a deterministic 
loss function whose minimizer over a particular class is equal to or close to 
the parameter. Under some technical assumptions, if the loss is differentiable 
with respect to the parameter, the empirical risk minimizers converge to the 
parameter with the rate |a„ — a{P)\ = Op(n~^/^) as the sample size n grows 
to do; see van de Geer [19] and van der Vaart and Wellner [20]. 
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Kim and Pollard [9] establish a new functional central limit theorem for 
empirical processes. They describe an interesting class of asymptotic prob- 
lems where the estimators converge at a rate different from Op{n~^^'^) to 
limit distributions. 

An important noncontinuous loss function, frequently used in the theory 
of classification, is the indicator loss function. Let us first describe a gen- 
eral view of classification (or learning theory). We formulate the simplest 
case, which is a two-class problem. Assume that we have two distributions 
on a result space X, labeled by y = 1 and Y = —1. The values of Y are 
called "labels" or "natures." Take an observation X from a mixture of the 
two distributions. It is sometimes called a "feature." The problem is to pre- 
dict the unknown nature y of a feature X. Suppose we have n i.i.d. copies 
{Xi,Yi), i = l,...,n, ofa realization (X, Y), having an unknown probability 
distribution P. A classifier /i is a measurable function /i: A"— > {±1}. (Here, 
we do not consider more general [— 1, l]-valued classifiers.) The realization 
{X,Y) is called misclassified by the classifier h if h{X) ^ Y. We take the 
deterministic loss function (x, y) i— > l{h{x) ^ y}. For C M and features X 
with a continuous distribution (at least close to a point), Mohammadi and 
van de Geer [13] apply this setup to the case where the classifier h is varied 
over the class TC = (/ia)aGMi where ha{x) := 1 for x> a and ha{x) := — 1 for 
X <a. Let 



:i.i) 



fp{x,y) = /+(x)l{y = 1} + fp{x)l{y = -1}, 

{x,y)£n = X X {±1}, 



denote the joint density of {X,Y), that is, the density of P with respect to 
some reference measure A (counting measure) . 

Let us assume that there is a unique point a{P) at which fp — fp changes 
its sign, and that this sign change is from "— " to "+." Then, to minimize 
the risk P[h{X) j^Y], it suffices to restrict the classifier h to the class TC, 
since one has in this case 

(1.2) inf P[h{X)^Y] = mlP[ha{X)^Y] = P[Kip){X)^Y]. 

classifiers n a£K 

The Bayes rule in this case corresponds to the threshold a{P) = argmiUagK Lp(a) 
where Lp{a) := P{ha{X) ^ Y) denotes the prediction error. A natural choice 
for an estimator of a is the threshold a„ = argmin^eR -Pn[/ia(-'^) 7^ y] that 
minimizes the classification error in the sample, where Pn '■= J2i=i ^{XiXi)/''^ 
denotes the empirical distribution of the sample. Strictly speaking, here the 
"arg min" is not unique, but one may take any (measurable) choice. In the 
theory of classification this is called empirical risk minimization. Moham- 
madi and van de Geer [13] invoke the theory of Kim and Pollard [9] to 
get the rate Op(n~^/^) of a„ under some conditions. Under monotonicity 
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assumptions, it is shown that is a nonparametric maximum hkehhood 
estimator, and that n~^/'^(d.„ — a{P)) converges in distribution to a continu- 
ous random variable. For more background information about empirical risk 
minimization in classification, see, for instance, [7], [16] and [11]. 

1.2. Statement of the problem and results. In this paper we address the 
following question: Is there any sequence of estimators (d„ : fi" — > M)„gN 
which converges to a{P) with a rate faster than Op{n~^^^)7 Under some 
assumptions, to be specified below, the answer is no. 

Let us introduce the class V of probability measures P that we consider. 
We assume that the feature X takes values in the unit interval X = [0, 1] . 

Let P denote the set of all probability distributions P = /p[A[o^i] (counting 
measure)] on 17 = [0, 1] x {±1} (with the Borel cr-field) with fp G C^([0, 1] x 
{±1}). Here, Ap^i] denotes the Lebesgue measure on [0,1]. [This particu- 
lar choice of the reference measure — at least locally — plays a role in some 
technical estimates, e.g., in the basic entropy bound (4.13) below.] 

We endow V with the metric d, given by 

(1.3) d{P,Q) := Wfp - /qIIoo + \\difp - difQlU 

where di denotes the derivative with respect to the first argument. 

Let V denote the set of all P S "P, such that fp := fp{-, 1) and fp := 
fp{-, —1) have a unique intersection point a{P), and this intersection point 
is contained in the open interval (0,1), and the intersection is transversal 
with a specified orientation, 

(1.4) /+(a(P)) = fp{a{P)), (/^)'(a(P)) > (/p)'(a(P)). 

We endow V with the topology induced by the metric d. 

For our results it is essential to have at least some control on the derivative 
of fp, which is reflected by the choice (1.3) of the metric d. 

Here, we present the main results of this paper. The first theorem con- 
siders the estimation error on the critical scale const ■n~^^^, uniformly over 
(small) open subsets of V. 

Theorem 1.1. Let U be a nonempty open set. Then there is ci = 
ci{U) > 0, such that for every 6 £ (0, 1/4] and for every sequence of estima- 
tors On : — > M, n G N one has 

(1.5) liminf sup P" [n^/2 1 a„-a(P) I >T]>5, 

n^oo p^y 

whereT = T{U,5) := ci| log(115)| 

Unlike Theorem 1.1, the following theorem considers the asymptotics of 
the estimation error point-wise, that is, it takes the limit as oo before 
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taking a supremum over open sets h( QV. To motivate tliis order of taking 
limits, consider the following game of a statistician against "nature." Na- 
ture chooses just one P £V, unknown to the statistician. The statistician 
may choose various sample sizes n, and she or he obtains a certain rate of 
convergence of the estimators as n ^ oo for this fixed, given P. Thus, ex- 
amining the limit n ^ oo for fixed, but arbitrary P may contain at least as 
relevant information as taking liminf„_>oo suppgj^. The following theorem 
does not examine the critical scale n~^/^; it rather works with a smaller 
scale «n-i/3. The reason for this is explained below. 

Theorem 1.2. Let (/3n)nGN be a sequence of positive numbers with 
limn^oon~^^^ Pn = oo. Then, for all nonempty open sets lA (^V and for all 
sequences of estimators a„ : fi" ^ M, n G N, one has 

(1.6) sup limsupP"[/3„|a„ - a{P)\ > 1] > 1/4. 

Theorem 1.2 states that, independently of how small our statistical model 
U is, we always find a particular model P in this class such that the es- 
timation error for this particular model will be with positive probability 
asymptotically larger than any given scale smaller than n~^/^. The proof 
of this theorem uses Baire's theorem. A related argument, concerning the 
equicontinuity and the consistency of substitution estimators with values in 
a metric space, is presented in [15]. 

In contrast to Theorem 1.1, Theorem 1.2 does not consider the critical 
scale const •n~^/'^. Indeed, its claim (1.6) breaks down on this critical scale. 
This is the content of the following theorem. 

Theorem 1.3. There is a family of estimators {an,L'-^^'' ^^)n&n,L>o 
with the following property: For all P £V, there is a neighborhood N 
of P, such that for all T > one has 

(1.7) lim sup limsupQ"[n^/2|a„,L - a{Q)\ > T] = 0. 

Such estimators an,L are explicitly described in Section 5 below. Speaking 
very roughly, one estimates the density fp in a certain neighborhood of size 
^^-1/3 q£ ^ ^xsi approximation of a{Q) using regression lines. 

The following corollaries translate the asymptotic bounds in Theorems 
1.1, 1.2 and 1.3 to bounds for the rate of convergence of the prediction error 
Lp{an) to the optimal value Lp[a{P)). 

Corollary 1.4. Under the conditions of Theorem 1.1, one has 

(1.8) liminf sup P"[Lp(a„) > Lp(aiP)) + 5?i~2/^] > 5, 

where S = S{U,5) := C2I log(115)p/'^ with a constant C2 = C2{U) > 0. 
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Corollary 1.5. Under the conditions of Theorem 1.2, in particular, 
for = o(n~^/'^), one has 

(1.9) sup lim sup P"[Lp(a„) > Lp{a{P)) + cg/J-^] > 1/4, 

with a constant C3 = c^iJA) > 0. 

Corollary 1.6. Take a family of estimators {an,L)n€N,L>o that fulfills 
the claim of Theorem 1.3. Then, for all P € V, there is a neighborhood 
'-^V of P, such that, for all T > 0, one has 

(1.10) lim sup limsupQ"[LQ(a„) > LQ{a{Q)) + Tn^^/S] = 0. 

1.3. Discussion and comparison to other results. Let us discuss our re- 
sults and compare them with some previous results on lower bounds in 
classification and regression. 

The paper [10] by Mammen and Tsybakov views the classification problem 
as the estimation problem of a set V . The authors consider the case that the 
region V has a smooth boundary or belongs to another nonparametric class 
of sets. They show that the empirical risk minimizers achieve the optimal 
rates for estimation of V and optimal rates of convergence for Bayes risks. 

It is interesting to compare our Theorem 1.1 with Theorem 3 in Mammen 
and Tsybakov's paper [10], in particular, with formula (22) there. The setup 
in the paper [10] is much more general than ours. It differs from the one in 
Theorem 1.1, even when one specializes it to our one-dimensional setup and 
to the special classifiers ha. More specifically, this specialization yields, for 
allp> 1, 

(1.11) liminf sup n^'/^-Epn [|a„ - a(P)|P] > 0, 

instead of our claim (1.5), where the class of distributions J-f^&g specified in 
the reference is not as small as our open set li. 

Let us compare the estimators On.L in Theorem 1.3 and Corollary 1.6 with 
the empirical risk minimizers a„, which are examined in the paper [13] by 
Mohammadi and van de Geer. For the empirical risk minimizers a„, one has 

(1.12) sup limsupQ"[n^/^|a„ - a{Q)\ > T] > 

QGA^ n^oo 

for all r < 00; more details are given in Theorem 2.2 in [13]. The empirical 
risk minimizers a„ may be well applicable for larger classes of distributions 
than V, where our results may not apply. Our intention behind Theorem 1.3 
is mainly to show that Theorem 1.2 is optimal. However, comparing (1.12) 
with (1.7), one sees that, for large L, an,L is an improvement over a^, at 
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least asymptotically in the limit as n — > cx). Thus, from a practical point 
of view, we suggest use of the estimators an,L instead of whenever one 
suspects the regularity conditions imposed in our model are reasonable in 
an application at hand. 

Roughly speaking, the improvement is obtained by using information 
about the empirical distribution in the neighborhood of the estimator a^. 
The scaling parameter L is used to determine the size of this neighbor- 
hood. More specifically, one estimates the unknown densities close to an 
using regression lines. For a given sample size n, it might not make sense 
to take L too large to get a good estimator cin.L) due to the order of limits 
"limL^oo • • -limsup^^oo" in (1.7). 

Donoho and Liu [6] consider estimating a functional T{F) of an unknown 
distribution F £ with some class of distributions J^. They compute the 
modulus of continuity u;(e) of T with respect to Hellinger distance in certain 
cases. For a well-behaved loss function l{t), they show that if T is linear and 
T is convex, then infx^ snpp^jr Ep {I {Tn — T(F))) is equivalent to /(u;(n~^/^)) 
within constants. The same conclusion is drawn for three cases of nonlinear 
functionals: estimating the rate of decay of a density, estimating the mode 
and robust nonparametric regression. Our case, estimating the intersection 
point of two densities, is a different case. However, it gets the modulus of 
continuity uj{£) = e^/^ for l{t) := \t\ and therefore, a;(n~-^/^) =n~^/^, which 
coincides with the optimal rate. 

The general estimates for lower bounds, presented in Section 3 below, 
can also be applied to higher-dimensional problems. This will be shown in 
a forthcoming paper. 

Let us briefly review some further known results which are vaguely related 
to the facts proven in this paper. 

Let {X,Y),{Xi,Yi),{X2,Y2), . . . be independent identically distributed 
M*^ X M random variables with E{Y'^) < oo. In a regression problem. Stone 
[17] showed that for a class of distributions and for a class of regression 
functions which are p times continuously differentiable, the optimal lower 
rate of convergence is n~'^P^^'^P~^^\ 

Antos, Gyorfl and Kohler [2] showed that there exist individual lower 
bounds on the rate of convergence of nonparametric regression estimates 
which are arbitrarily close to Stone's minimax lower bounds. 

In classification Antos and Lugosi [3] showed that for several natural 
concept classes (classes of subsets of X, the domain oi X), including the 
class of linear half-spaces, there exist a fixed distribution of X and a fixed 
concept C such that the expected error is larger than a constant times k/n 
for infinitely many n, where k is the number of parameters. They obtained 
strong minimax lower bounds for the tail distribution of the probability of 
error, which extend the corresponding minimax lower bounds. 
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Our second form of lower bound, that is, Theorem 1.2, is comparable with 
the individual lower rate of convergence in [1]. In the latter, the individual 
lower rate of convergence for a class D of distributions of {X,Y) is defined 
by an which satisfies 

(1.13) inf sup limsupa^^ I Lp{gn) — ToiiiLp{g) I > 0, 

where g \s a. classifier and gn is an estimator. A class of distributions Dp 
of (X, y) is given as the product of one uniform distribution and a cubic 
class of regression functions with parameter jS. Under some assumptions, the 
individual lower rate of convergence for Dp is obtained by 6„n-2/3/(2/3+rf)_ xhe 
class Dp is of course different from our class V, but the order of inf, sup and 
lim sup in (1.13) is the same as ours in Theorem 1.2. 

For more references on lower bounds, see Gill and Levit [8] and Tsybakov 
[18]. 

For a more general nonparametric setup than ours, Pfanzagl showed in 
[14] that no limit distribution can be attained with the rate n^/^ uniformly 
on certain shrinking neighborhoods of the sample distribution P. 

In a paper by Biihlmann and Yu in [4], the n^/^-asymptotic appears in the 
context of bagging. These authors also use the results of Kim and Pollard [9] . 
Using decision trees, problems concerning higher dimensional X are reduced 
to the analysis of a one-dimensional setup. 

Organization of this paper. Let us explain how the rest of this paper 
is organized. In Section 2 we collect some fundamental entropy estimates. 
Section 3 shows universal, general counterparts of Theorems 1.1 and 1.2, 
without assuming the specific form of our model {Vl^A^V). We expect these 
lemmas to be useful for other examples too. One key idea is the use of 
Baire's theorem to show that the set of P's with estimation errors being 
asymptotically large on a given scale is of the second Baire category. In 
Section 4 Theorems 1.1 and 1.2 are proven. The proofs are based on a 
bound for relative entropies for slightly perturbed densities, described in 
Lemma 4.1 below. In Section 5 Theorem 1.3 is shown by constructing the 
estimators a„^L in a two-step procedure. Section 6 contains the proofs of 
the corollaries. The key idea for the higher dimensional case is sketched in 
Section 7. 

2. Preliminaries. In this section we review some standard estimates to 
compare probabilities with respect to different measures, based on bounds of 
the relative entropy. Alternatively (and more or less equivalently) , one could 
use bounds for the Hellinger distance instead of the relative entropy, but we 
do not follow this alternative approach here. Here, we need not assume 
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any specific form of the model {i},A,V); we take an arbitrary parameter 
a : "P ^ M, and (tin : fi" ^)neN denotes any sequence of estimators. 

Let H{P, Q) := Ep[log ^] denote the relative entropy for P,Q €V, when- 
ever it is well defined. 

Lemma 2.1. Let P and Q be probability measures with H{P,Q) < oo. 
For every random variable X with < X < 1, one has 



(2.1) 



EQ[X]>e-'^^^'Q^-\Ep[X]-'^). 



Proof. For x >0, set ip{x) := xlogx — x + 1. Note that ip >0. We set 
AT := f.2H{P,Q)+i > e and ^ := {dP/dQ > N}. Using ^ > 0, ^(1) = and the 
convexity of ip, one sees that 



(2.2) 

for all a; > 0, and thus, 
(2.3) HiP,Q) = E, 



H^) > -\--^l{x > N}x 



> 



N 
N 



Er 



N 



P[A]. 



We conclude, using ij{N)/N = log{N/e) + 1/N> 2H{P, Q), 

1 



Eq[X] > EgiXliA'^)] > l-Ep[Xl{A^)] > hEp[X] - P[A]) 



(2.4) 



> 



N 



N 
N 



which is the claim (2.1). □ 

The 2 in the exponent of (2.1) could be replaced by any fixed number 
larger than 1, if one replaced the ^ in (2.1) by a different constant. This 
would only change the constants in our main theorems. 

Lemma 2.2. Let x-^^[OA] be a measurable function with xi^) = 1 
for \x\ < 1 and xi^) = for \x\ > 2. Take n E N, > 0, Pn,Qn £ 'P, o.nd 
5 G (0,1/4]. // 

(2.5) nF(P„,Q„)<^log-i-, l3n\a{Pn)-a{Qn)\>^ 

Z llo 

hold, then at least one of the two bounds 

Epn[x{(3n{an-a{Pn)))]<l-5 or 

(2.6) 

Eqi [xWn{an - a{Qn)))] <l-5 

is valid. 
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Proof. (Indirectly). Assume that both formulas in (2.6) fail to hold. 
Using Lemma 2.1, we get 



Ql[\(3n{an-a{Pn))\<2] 

>EQn[x{(5n{an-a{Pn)))] 

(2.7) 

> e-2-^(^-«")-i [Ep^UPnian - a(P„)))] - - 

> -2nH(P„,Q„)-l }_^-2nH(Pn,Qr.) > > ^ 

V 2 y - 4e - 4e 

by (2.5); recall that 5 < 1/4. Furthermore, we have 
(2.8) Qiman - a{Qn)\ < 2] > ^Q.[x(/3„(a„, - a{Qn)))] >l-5 

by the opposite of the right-hand side of (2.6) and the choice of x- Recall 
that X : IK — > [0, 1]. The bounds (2.8) and (2.7) imply that the events {(3n\an — 
o-{Qn)\ ^ 2} and {/^nl^n — ffl(-Pn)| ^ 2} have a nonempty intersection. This 
implies the contradiction /5„|a(P„) — a{Qn)\ < 4. □ 

3. General lower bounds. In this section we prepare the proofs of The- 
orems 1.1 and 1.2 by providing some general, abstract lower bounds for 
estimators. Since we expect these lemmas to be useful also in contexts other 
than the estimation of thresholds, we do not assume fi, V and a: P ^ ^(-f ) 
have the specific form described in Section 1. Rather, in this section {Q,A) 
may be any measurable space, V may be any set of probability measures on 
{i},A), endowed with a topology, and a:V may be any parameter; only 
some very general assumptions for the topology on V and for the parameter 
a are required in Lemma 3.2 below. The first lemma in this section is a 
general statement which plays an essential role in the proof of Theorem 1.1. 

Lemma 3.1. Take a sequence (/3n)nGN of positive numbers, 5 £ (0,1/4], 
and a nonempty open set U '^V . For all large n G N, assume that there are 
Pn , Qn G U such that 

(3.1) nH{Pn,Qn) <\log^^ and /3„|a(P„) - a(Q„)| > 4 

1 Lid 

hold. Then for all sequences {an : 1^" — > K)„gN of estimators, one has 

(3.2) liminfsupP"[/3„|a„-a(P)| >l]>5. 

n^oo 

Proof. Take x ■= !]• By (3.1) and Lemma 2.2, (2.6) holds for ah 
large n. Thus, 

(3.3) inf P"[/3„|a„-a(P)| <1] <l-5 
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holds for all large n; hence, (3.2) is true. □ 

The constant 4 on the right-hand side in (3.1) is not optimal. However, 
improving it does not improve our main theorems. 

The next lemma provides an abstract key ingredient for the proof of 
Theorem 1.2. 

Lemma 3.2. Let a-.V be a parameter and (a„ : — > M).„gN be a 
sequence of estimators. Let V be endowed with a Baire space metrizable 
topology. Assume that a:P— >M is continuous. Furthermore, assume that 
for all P gV, the total variation distance from P, that is, Q \\P — Q\\a = 
sup^g^|P[A] — Q[yl]|, QgV, is continuous too. Let (/3„)nGN be a sequence 
of positive numbers, and take 6 G (0, 1/4]. 

Suppose that for all P £ V, for all neighborhoods N of P, and for all 
m G N, there are n>m and Qn G N such that 

(3.4) /3„|a(P)-a(Q„)| >4 and nH{P,Qn) <\\\og{ll5)\ 
are valid. Then for all nonempty open sets U , one has 

(3.5) suplimsupP"[/3„|a„ -a(P)| >l]>5. 

Proof. Let x:M^ [0,1] be a continuous function with x{^) = 1 foi' 
|x| < 1, and x{x) = for \x\ > 2. For m,re G N, we set 

(3.6) Vn:={P£V:Ep4x{Pn{an-a{P)))]>l-5}, := f) ^n- 

n>m 

We claim that the map 

(3.7) ^^^[0,1], P^Ep^[x{Pn{an-a{P)))] 

is continuous. To prove this claim, let P £V, and consider a sequence {Qk)k 
in V converging to P. Then for all u) G Q.^, we have 

(3.8) x{Pn{an{oj) - a{Qk))) xiManic^) - a{P))) 

by the continuity of a and of x- Using Lebesgue's dominated convergence 
theorem, this implies 

(3.9) Epn[x{Pn{an - a(Qfc)))] i^p- [x(/5n(a„ - a(P)))]; 
recall that x takes values in the unit interval. Furthermore, 

\Eq]^ [x{Pn{an - a{Qk)))] - Epn [x(/3„(a„ - a{Qk)))]\ 

(3.10) 

<n||Q,-P|U'^0, 
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since, by our hypothesis, the total variation distance from P is continuous. 
Combining (3.9) and (3.10), we get 

(3.11) EQ^Jx{Pn(.&n-a{Qkm''^Epn[x{f5n{&n-a{PM, 

which shows that Epn[x{f3n{an — o.{P)))] depends continuously on P. Note 
that in the last step we used the fact that the chosen topology on V is 
metrizable (or, at least, that sequential continuity on V implies continuity). 

The continuity of the map described in (3.7) implies that the sets Vn '^V 
are closed; thus, their intersections are closed too. 

Next, we show that the sets J-m QV , m are nowhere dense. To check 
this, take P G J-'m and a neighborhood of P in V. By the hypothesis 
of the lemma, there exist n>m and Qn G M such that (3.4) holds. Then 
Lemma 2.2 implies 

Epn[xiPn{an-aiP)))]<l-5 or 

(3.12) 

EQ^[xiPn{an - a{Qn)))] <l-5, 

that is, P ^VnTl ^rn or Qn ^VnTl ^m-, and thus, M 2 !Frn- This shows that 
indeed !Frn is nowhere dense. 

Let lA be a nonempty open set. Since V is endowed with a Baire 
space topology, we conclude that U is not contained in {j^^^J^m'-, so we can 
take P £U\ UmeN -^rn- For this P, we know that P ^Vn for infinitely many 
n G N. Thus, we get 

(3.13) liminf Epn[x(/3„(a„ - a(P)))] <l-6. 

n— >oo 

Using 

(3.14) P"[/3n|an - a(P)| > 1] > 1 - Ep^xiPnian - a(P)))], 
this implies 

(3.15) limsupP"[/3„|a„ - a{P)\ > 1] > -5, 

n— ►oo 

and thus, the claim (3.5) follows. □ 

4. Lower bounds for errors in threshold estimation. In this section we 
prove Theorems 1.1 and 1.2. Thus, let V denote again the concrete space 
of probability measures defined in Section 1, endowed with the metric d, 
defined in (1.3). Note that (V,d) is a complete metric space. We claim that 
a-.V ^ [0,1] is a continuous parameter. Indeed, this is a consequence of the 
implicit function theorem applied to the map F : (0, 1) x C^([0, 1] x {±1}) — > 
M, F(x,f) := f{x, 1) — f{x, —1). Theorem 10.2.1 in [5] presents a version of 
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the implicit function theorem applicable to our situation. The map F is 
continuously differentiable with the derivative 

DF{x,f):{Ax,Af) 

(4.1) 

^ [Dif{x, 1) - Dif{x, -l)]Ax + Af{x, 1) - A/(x, -1); 

thus, the implicit function theorem is applicable for any point {x, f) for 
which the transversality condition Dif{x,l) ^ Dif{x,—1) holds. It yields 
the continuity of the function a:7^ — > (0, 1), implicitly defined by the equa- 
tion F{a{P), fp) =0. Furthermore, V is an open subset of the space V. In 
particular, by Baire's category theorem, V is a Baire space. 

The following lemma contains the basic entropy estimate for perturbed 
densities. Here is the idea. A given probability density fp is slightly modi- 
fied by a perturbation of order 0(e) in a neighborhood of size 0{e) of the 
transversal intersection point a{P). Let Q denote the probability measure 
corresponding to the modified density fq; then we show that the entropy 
H{P, Q) has roughly the order O(e^), but the parameter a{Q) deviates from 
a{P) on a scale of order e. The cube of e arising in the entropy bound is the 
key to derive the cube root asymptotic lower bounds in this paper. 



Lemma 4.1. Let P gV, and let U (^V be an open neighborhood of P. 
Then there is c\ = ci{P,U) > such that for every 6 £ (0, 1/4] and for all 
large n [say for n > nQ{P,U^6)], there is Qn G U such that 

(4.2) nH{PM<\\\og{ll5)\ and \a{P) - a{Qn)\> ^. 

ci| log(llo)|^/-* 

Proof. Choose a ball M ^lA (with respect to the metric d) centered at 
P . Let r denote the radius of N . Let ci > be small enough (to be specified 
below). Take (^G (0,1/4]. We abbreviate 

^1/3 

'^"- %.|log(lM)|./3 >°- 

Take a fixed, compactly supported (/> E C^(M) with i;^(0) = 1, < (/) < 1, and 
||<A'l|oo||/p||oo < T. For e > 0, we set 



(4.4) H,(rE):=e0 



x-a{P) 



e 

Recall the definitions of fp and fp in (1.1). We set 



ESTIMATION OF THRESHOLDS 



13 




Fig. 1. Perturbation of the two densities. 



at least in some compact neighborhood Vp of a{P), these functions are well 
defined with values in [1/3,2/3]. For all small e > 0, is supported in such 
a neighborhood Vp. We set 

e„:=C4|log(115)|i/3^-i/3 



(4.6) s 

where C4 = C4 := (||/p||oo||(?5'||2)~^^^- Let Q„ be defined by its density 
/q„, where 



(4.7) 
(4.8) 



/±^:=(l + H,„pJ)/±, 



Figure 1 illustrates these definitions. 

Here '^e^Pp is to be interpreted as outside the support of H^^^. Note 
that for all large n, Jq^ is well defined. As a consequence of the assumptions 

(1.4), one sees that fp{a{P)) = fp{a{P)) > 0. For large n, e-a is small; thus 
/g^ is nonnegative and /q„ is a probability density. Furthermore, using 

||^!''||oo||/p||oo < T and \ppfp \ < ||/p||cxD5 one sees that d{Qn,P) < r and, thus, 
Qn & M holds for all large n. We calculate, for (x, y) G il. 



(4.9) 



~dP 



{x,y) = 1 + (l{y = l]p-p{x) - l{y = -l}p+(x))S,„(x). 



For \t-l\< 1/2, one has - \ogt + t- 1 < |t- Ip. So 



(4.10) H{P,Qn)=Ep 



dQn , dQr^ 
log + 



dP dP 



< Ep 



dQn 
dP 



2n 



14 F. MERKL AND L. MOHAMMADI 

for l^p- — 1| < 1/2, which holds for all large n. Note that \'^sn \ — ^n- For all 
large n, pp G [1/3,2/3] holds on the support of Hg^. Then one has 

(4.11) l<\Hy = l}/Op(x) - l{y = -i}pU^)\ < i 

on the support of , and therefore. 

We get the following O(e^) estimate for the relative entropy: 
nH{P,Qn) < lnEp[El{X)] < ln\\fp\U\E,Jl 



<^^ll/p||oo||<^|lie^<i|log(115)| 



(4.13) 

by the choice (4.6) of £„. 

[As a side remark, note that the estimate (4.13) relies on our choice to 
take the Lebesgue measure A[o.i] in the reference measure. In some cases 
with arbitrary reference measures it would break down.] 

On the other hand, defining in analogy to (4.5), from (4.7) and using 

+ J^Qn =fp+fp^^^ follows that 
(4.14) p± :=(1 + H,„p^)p±. 



\EMP))\ = \EMP))p-pi<p))pU^im 



Using ppiaiP)) = p+iaiP)) = p+Ja{Qn)) = 1/2 and (4.14), we get 

£n ^ 1, 

4 ~ 4' 

(4.15) = \pUaiP)) - p^JaiP))\ = |p3„(a(g„)) - /9^Ja(P))| 

<\\{p^J\\ooyMQn)-a{P)\. 

Taking the derivative of (4.14) and taking a supremum over Vp, we see that 
II (Pq^)'||oo,Vp is bounded by a constant C5(P, ip) > for all large n; note that 
ll"£„l|oo = ll</''l|cxD does not depend on n. We obtain 



(4.16) 



4C5(P) 



C4|log(ll(^)|l/3n-l/3 



when we choose 

(4.17) 0<ci{P,U)< 



ci|log(lW)|i/3 4C5(P, 

C4(P,(^) 



16C5(P,(/.)' 

recall the choices (4.3) and (4.6) of /3„ and The statements (4.16) and 
(4.13) together are just the claim (4.2). □ 
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Proof of Theorem 1.1. Take a fixed P eU, and take ci{U) = 
ci{P,U) > from Lemma 4.1. Then Lemma 4.1 guarantees that the hy- 
pothesis (3.1) of Lemma 3.1 holds with Pn = P, where Pn is again given by 
(4.3). Thus Lemma 3.1 yields the claim (1.5). □ 

Proof of Theorem 1.2. Tlie class V with the metric d is indeed a 
Baire space, and a:V ^ (0)1) is continuous. Note that the total variation 
distance is continuous with respect to d. 

We check that Lemma 3.2 is applicable with (5 = 1/4. Let P & and let 
M he a neighborhood of P in "P. We apply Lemma 4.1 to obtain a sequence 
{Qn)n in such that (4.2) holds. Hence, we get, for all large n [say, for 
n>ce{P,U)], 

„l/3 

(4.18) (3n\a{P) - a(g„)| > niAMi/s '^^-^) " > ^ 

ci(P,Z^)|log(ll(5)|V^ 

by n~^/'^/3„ oo. Together with the entropy bound in (4.2), this shows that 
Lemma 3.2 is indeed applicable, and it yields the claim (1.6). □ 

Note that this proof would break down if we had taken Pn on the critical 
scale Pn = const -n^/^. Indeed, the constant ci{P,U) depends on U (and it 
really diverges as U gets smaller), but Pn must not depend on the choice oiU. 
This breakdown has to occur, since Theorem 1.3 shows that the claim (1.6) 
of Theorem 1.2 cannot hold any more on the critical scale Pn = const -n^/^. 

5. Optimal estimators for thresholds. In this section we prove Theo- 
rem 1.3. The optimal estimators, whose errors asymptotically meet the bor- 
der permitted by the lower bounds, are constructed by a two-step procedure. 
In the first step (Lemma 5.1 below) we use the empirical risk minimizer to 
obtain a starting estimator, which yields error terms roughly on the scale 
j^-i/3_ 'pj^g second step (Lemma 5.2 below) constructs a refined family of es- 
timators, based on a starting approximation qq. Let (X,, YJ) : fi"' O again 
denote the canonical projections. 

Lemma 5.1. There is a sequence of estimators dn-^^ — > [0,1], n G N, 
such that, for all P £V, there is a neighborhood M{P) of P in V , such that 

(5.1) lim sup lim Q"[|a„-a(Q)| >Ln-i/3] =0. 

Proof. Consider the empirical risk minimizers a^, n S N. We abbrevi- 
ate := fq+fq - Take P G P and 

M{P):={QeV:2\{p%)'{a{Q))\>\{pt)'{a{P))l 

(5.2) 

2/|(a(Q))>/|(a(P))}, 
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where pq is again defined as in (4.5). By the transversahty of the intersection 
point a{Q) of /g and /q, the maps (PgYio^iQ)) ^^'^ Q ^ fqi'^iQ))^ 

Q G MiP), are continuous. Furthermore, fp{a{P)) > and | (pp)'(a(P))| > 
hold. Using these facts, one sees that M{P) is a neighborhood of P. 
Take Q£M{P). By Theorem 2.2 in [13], we know that 



(5.3) n'/^an - a{Q)) ^ [{p^q)' {am^ f^HQW'" Z 

as n oo for some continuous random variable Z not depending on Q (with 
respect to some probability measure P). We set 

(5.4) a:=^mf^^[(p+)'(a(g))y'/|(a(g))]2/l 

Note that q > by the choice (5.2) oi J\f[P). We now obtain the claim (5.1): 
sup 



(5.5) = sup ¥[Z>[{p+Q)'{am^f^HQ))f'L] 

< F[Z > aL] 0. □ 

The next lemma is used to construct the refined estimators in Theo- 
rem 1.3. Here is the idea. Given a starting approximation ao for a{Q) with 
an error on the scale n~^/^, consider all data points in a neighborhood of 
size Ln~^^^, where L is large, but fixed. Then construct a regression line 
through the data points in this neighborhood, and take the intersection of 
this regression line with the x-axis as the refined estimator. 

Lemma 5.2. There is a family of estimators {an,L,ao '■ — > I^)nGN,L>o,aoG(o,i) 
with the following property. For all P gV, there is a neighborhood M 
of P, such that, for all T > 0, one has 

(5.6) lim sup lim sup sup Q^V'^^'^\an,L.ao — <i{Q)\>T] = ^. 

L^°°QGAf ri^oo aoG(0,l) 

|ao-a(Q)|<Ln-i/3 

Proof. Given ?i G N, L > 0, and G (0, 1), we introduce the abbrevi- 
ations Xi := Xi — ao, M := Ln~^^^ and Im '■= [—M,M], and we define the 
random set 

(5.7) J:={i:l<i<n,X,G/M}. 

We define the estimator an,L,ao &s follows: Consider the regression line ^ 
(equation y = bix + 62) through the points {Xj,Yj), j £ J, provided it is 
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well defined, that is, provided there are at least two different Xj^ ^ Xj^, 
h,ji G J- If h / 0, we set 

(5.8) On.L.ao := ao - 7^- 

Geometrically this means that an^L,ao is the intersection of the real axis with 
the regression line through the points {Xj,Yj), where only the points with 
\Xj — qqI ^ Ln~^/^ are taken, whenever this intersection is well defined. If 
the regression line i is not well defined, or if 6i = 0, we set an,L,aa = ooj just 
to have a definite value in this case too. 
We abbreviate, for Q gV, 

(5.9) SQ := /^(a(Q)), tg := (/^)'(a(Q)) - (/q )'(a(Q)), 

where again fg : = /g + /g . Let P £V, and take the following neighborhood 
of P: 



^D.iUj yv .- fc / . ^^^^ ^ ^^^^^2, 1 _ a(Q) > (1 - a[s 



(P))/2}- 

Take T > and 5 > Let L be large enough [more specifically, so large that 

(5.11) 2L>T, l3>s2:=^, 

hold with some positive constants cy = cj{N') and cg = C8(AA), to be specified 
below]. We claim that, for all Q gM, one has 

(5.12) limsup sup Q''[n^^^\an,L,ao " «(<5)l >T]<S. 

ao6(0,l) 
|ao-a(Q)|<Ln-l/3 

The claim (5.6) of the lemma then follows immediately from the statement 

(5.12) . 

Here is a sketch of the proof of (5.12). For a complete proof; see [12]. 
To prove (5.12), let Q G M. For every n G N and for every uq £ (0, 1) with 

(5.13) \ao-a{Q)\<Ln-^/^, 

we are going to define an event B = B{Q, n, aQ, S, L) C with the property 

(5.14) Q"[S(Q,n,ao,5,L)] > l-(^. 

[This is done in (5.22) below.] We then show that, for all large n [say, for 
n>nQ{Q,J\f,S,L), but uniformly in the choice of oq], one has 

(5.15) B{Q,n,ao,S,L) ^{n^/'^lan^L^ao - a{Q)\ <T}. 

Once we have proven (5.14) and (5.15), the claim (5.12) is an immediate 
consequence. 
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Take Q € M and ao € (0, 1) with the constraint (5.13). It is convenient to 
shift quantities by ao- We set 



(5.16) 
(5.17) 



a :- 



a{Q) = a{Q) - ao, 



/q (^) /q + «o), fqioo) := f^{x + ao). 



In particular, note that \d\ < M = Ln and that for all large n [uniformly 
in ao and Q £ M with the constraint (5.13)], /g is defined at least on Im- 

The coefficients bi and 62 of the regression line i are determined by the 
linear system Ab = c, where 



A 



E^A 



(5.18) 



b:-- 



bi 

/E^.^A 

We introduce the (normalized) difference of the coefficient matrix from its 
expected value, 



(5.19) 
(5.20) 



/M3/2ri 



1 



n 



lA-EQn(A)), 



^q4c]). 



Our reason to normalize the Ajj and Fj in this way is the following bound 
on the variances (to be read element- wise) : 



Vargn 



All 
A21 



(5.21) 



VarQn 



A12 
A22 



ri 
r2 



l{X,elM}[^V/^s 1/M 



< 



c? C7 
C7 C7 



< 



with some constant cy = Cf{M) > 0. We have used the fact that the density 
/q of {Xi,Yi) is uniformly bounded for Q £ M. 



ESTIMATION OF THRESHOLDS 19 

Here is the definition of the event B: 

(5.22) B{Q,n,ao,S,L) := < S,\Ti\ <S {i J e {1,2})}. 

Chebyshev's inequaUty, (5.21) and the choice (5.11) of S imply the claim 
(5.14), 



(5.23) 



Q''[B{Q,n,ao,S,LY]< 



5C7 

5^ 



6. 



The factor 5 arises since there are five random variables involved (recall 

Al2 = A2l). 

In the rest of this proof we verify the claim (5.15) for all large n (uniformly 
in oo). So assume that the event B{Q,n,ao, S, L) holds. 
The system Ab = c is equivalent to 



„2 fT. 



hi 



x) dx + 



'•jQ{x)dx + 



n 



xfQ[x)dx+ y= > 



21 



n 



f^{x)dx + 



MV2a 



22 



n 



(5.24) 



: n 



V 



x{f^{x)-fQ{x))dx + ^^ ^ 

hi V 

A/l/2p 

if^{x)-fQix))dx+^^ J 

M y 11' / 



Let us introduce some notation used in the Taylor approximations below. 
The variables = ^j(x, a, Q, oq) denote some values between x and a. The 
variables £j denote error terms which are bounded by AA-dependent con- 
stants \£j\ < const (AA), and 6j denote error terms which are bounded by 
\6j\ < const ■a{Q,2M), where 

(5.25) <T(Q,r):=max sup |(9i/q(xi, y) - 5i/q(x2, y)| ^ 

denotes the modulus of continuity of difq; recall that /g G C^(Jl). Recall 
that \d\ < M. Let us approximate the integrals in (5.24) by Taylor's formula, 

rAI 

x\sQ + {x-d){f^nCi))dx 



(5.26) 
(5.27) 



x^fqix) dx 



M 



lM\sQ + eiM\ 



Xfgix) dx 

f^{x)dx 



M 



x{sQ + {x-d){f^y{C2))dx = e2M' 



-M 
M 



{sQ + {x-~a){f^y{^3))dx 



M 
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(5.29) 
(5.30) 
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xifqix) - fQix))dx 



2MsQ + e^M\ 

M 



M 



<x-a){{f^)\U)-Uo)'{iA))dx 



lMhQ + 5iM^ = e4M^, 



hi 



M 



{f^{x)-fQ{x))dx= / {x-a){U^y{i,)-{fQ)'{i,))dx 



M 



-2aMtQ + 52M'^ = e^M'^ . 



We rewrite the system (5.24) as 



' -M^SQ + eiM^ + 



n 



£2M^ + 



21 



n 



e2M' + 



2MsQ + e3M^ + 



12 



n 

mV2a 



22 



n 



(5.31) 



/ 2 o o M^/^p \ 



v 



e^A/r H ^ > 



/n 

-2aMtQ + (JaM^ + - 



n 

£5^2 + - 



n 



J 



both forms of the right-hand side are useful below. Dividing the first row in 
(5.31) by (2/3)M3sq and the second row by 2Msq, we get the normalized 
system 



1 + M 



3 ei 3 All 

+ 



2SQ 2sqL3/2 
, 1 A21 



3 £2 _^ 3 A12 



(5.32) 



tQ 
tQ 



( 



2SQ L3/2 



, 3 5i 
1 + -— + 



1 + M 
3 Ti 



2 SQ 2sQ L3/2 
j^f^ 1 ^22 



2sQ L^l^ 



-d + M 
eg + £9 



2 tQ 2tQ L3/2 



+ 



1 r. 



\2tQ 2tQL3/2;/ 
Ti 



M 



L3/2 
^ ^2 

£10 +£11-^ 
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Heuristically, one should view (5.32) as a perturbation of the system 



(5.33) 



tQ 



for which one knows — = a. 

By (5.11) and the definition (5.22) of the event B, we know, for i,j G 



< 



< 1 and 



{l'^}> 1^3/2 1 2,3/2 — JJ72 
the constant in (5.11) to be cg 



2tQ L3/2 I — '^8JJJ2 

C8(AA) =supQg^|3/(2tQ)| < 3|tp|-i < 00; 



< 



5L' 



when we choose 



recall the definition (5.10) of M, and recall that we assume the event B 
holds. 

For M < 1, we rewrite (5.32) in the form 



1 + Mei2 
M2ei4 



£13 
1 + Mei5 



1 +£16^1 +^17;^ 



-a + M 



(5.34) 



^18^2 + £19 



^ f £20 



\Me 



21 



where l^igl < 1/5 and |ei7| < 1/5. 

Let us consider the asymptotics of 62/^1 as n — > 00, that is, as M = 
^^-1/3 _^ Q Pqj, large n (uniformly in oq), the system (5.34) is nonsin- 
gular; recall that the error terms £j are bounded uniformly in cq. We get, 
for all large n, 

1 + Mei2 £20 
M2ei4 _a + M[ei8<52 + £i9r/L] 



(5.35) 



62 
61 



£13 
1 + Mei5 



1 + ei6-5i + £nT/L 

M£2l 

Let (^2 and di denote the determinants in the numerator and denominator 
of the right-hand side in (5.35), respectively. Recall that the error terms 82, 
5i converge to as n— >oo, uniformly in ao; see (5.25). Using |ei9|,|ei7| < 
1/5 < 1/4, we get, for all large n (uniformly in ag), 



(5.36) 



do\ < 



MT 



|l-a!il<— , 



4L ' ' ' - 4L' 
We have used T < 2L in the last step. 

Using the definition (5.8) of an,L,ao and (5.35) 
n (uniformly in oq). 



1 



< 



T 

2L' 



we conclude, still for large 



n,L,ao 



am 



(5.37) 



62 
61 



d2 

di 

MT T 



, , T MT 

< a 1 h 

- ' ' 2L 4L 4L 2L 



< 



MT 



L 



Tn-^l^- 
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recall that |o| < M and T < 2L. Thus, the claim (5.15) holds for all large n, 
uniformly in oq- □ 

We now use Lemma 5.1 to obtain the starting approximation required 
by Lemma 5.2. 

Proof of Theorem 1.3. Let us abbreviate Zj := {Xj,Yj). Let n G N, 
n > 2, and L > 0. We construct the estimator an,L by a two-step procedure. 
We split the sample Zi,. . . ,Zn into two halves Zi, . . . , Zm and Z^+i , . . . , Z2m , 
where we abbreviate m = m{n) := [n/2\ . (For odd n, we drop one data point 
at the end.) We then use the first half of the data to get a rough estimate 
flm = CLmiZi, . . . , Zm) by Lemma 5.1. Using this as a starting estimate, we 
refine it by Lemma 5.2, applied to the second half of the data 

(5.38) an,L '■= am,L,dTn{Zm+l, ■^2m)- 

It is important to split the data into two disjoint pieces, since then am is in- 
dependent of {Zm+i, . . . , Z2m)] thus, we have good control of the distribution 



conditioned on am{Zi, . . . , Zm) - By a slight abuse of notation, we abbreviate 



Let P gV, and let M be the intersection of the two neighborhoods of P 
that were constructed in Lemmas 5.1 and 5.2. Let T > and 5 > 0. By the 
same two lemmas, we know, for all large L, 



of 





(5.39) 




with the events 



(5.40) 



A{Q,n,L) := {\am - a 



(g)|<Lm-^/3} 



and 



(5.41) sup limsup 



sup 

aoe(0,l) 
ao—a{Q)\<Lm 



m'/^\am,L, 



a(Q)l>|]<^- 



-1/3 



Note that, for all large n [say, for n > no{Q,L)], we have cim G (0,1) on 
the event A{Q,n,L). Let Q G Af, and let n G N be large enough. Using 
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2m^/^ in the first step, we get. 



Q"-[n^'"\an,L 



(5.42) < Eq. 



m 



1/3, 



T 



an,L - a{Q)\ > — 



+ Q''[A{Q,n,Lr] 



ITT- ^ \0'm.L.a„ 



a(Q)l>| 



l(A(Q,n,L)) 



l{A{Q,n,L)) 



S 

+ 2- 



Using the independence structure and (5.41), we know for almost ah ao with 
|ao — a{Q)\ < Lm~^/^ (almost all with respect to the law of a^) 



(5.43) 



1/3 



am,L,am - aiQ)\ > - 



ITT' ^ |Oj7i,L,ao 



«(Q)I>| 



6 

< -. 

- 2 



Combining (5.42) and (5.43), we conclude Q''[n^/^\an,L - a{Q)\ > T] < 5. 
This finishes the proof of the claim (1.7). □ 

Let us finally describe a simple counterexample, showing that the limsup^^ 
in the claim (1.6) of Theorem 1.2 cannot be replaced by liminf„_>oo- 

Let us take to be constant estimators in the following way. For all 
A; e No and n G [2'=, 2^+^[n N, set := (n - 2^)2"'=. Then, whatever the 
value a{P) G [0,1] is, we have the following: For all fc € No, there is n £ 
[2'=, 2*^+1 [nN with \an-a{P)\ < 2"'^ < 2/n. But then 

(5.44) liminf P"[/?„|an - a(P)\ > 1] = 

n — *oo 

whenever /3„ = o(n) as oo. 

Intuitively speaking, the counterexample uses the following idea. If you 
have many nonrunning clocks, one for every minute of the day, all showing 
different times, then one of them will show the correct time, up to 1 minute. 

6. Asymptotic bounds for the classification error. In this section we 
present the proofs of Corollaries 1.4, 1.5 and 1.6. The proofs depend on 
the following lemma, which is based on a Taylor expansion. 

Lemma 6.1. For all Q £V, there is a neighborhood U of Q in V , and 
there are positive constants cg = cg{ly() , 03 = c-^iJA) and cio = ciq{U), such 
that, for all P (zU and for a// a G M 

(6.1) min{c9, C3(a(P) - q)'} < Lp{a) - Lp{a{P)) < cio(a(P) - af. 
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Proof. For P eV, we set mp := - fp. Note that mp{a{P)) = 0. 
Furthermore, for all a G (0, 1), 

(6.2) Lp{a) — Lp{a{P)) = / mp{x)dx= / mp(x)(Q — x) dx. 

Ja{P) Ja{P) 

Since m'p (in the || • ||oo-norm) and a{P) depend continuously on P, the fact 
m'Q{a[Q)) > implies the following for some neighborhood U oiQ and some 
e = e{U) > 0. For all PeU, one has [a{P) - e, a{P) + e] C (0, 1), 

C3 := i inf inf m'p(x) > and 

^ P(^Uxe[a{P)-e.a{P)+e] 

(6.3) 

cio := A sup ||mp||oo < oo. 
Peu 

Thus, we get the upper bound in (6.1). Moreover, for all a G (0,1) and 
P eU with \a - a{P)\ < e, we have C3(a(P) - a)^ < Lp{a) - Lp{a{P)). 
Since sign(mp(x)) = sign(2; — a{P)) holds for all x G [0, 1], we have Lp{a) > 
Lp{a{P) - e) for a < a{P) - e, and Lp{a) > Lp{a{P) + e) for a > a{P) + e. 
Hence, we get the lower bound in (6.1) with cg := c^e^. □ 

Proof of Corollaries 1.4 and 1.5. Consider an arbitrary sequence 
7„ with 7„ "iH^ QQ_ Then, it follows from the lower bound in (6.1) that, for 
c^T"^ < C97^, that is, for all large n, we have 

(6.4) {7„|a„ - a{P)\ > T} C {-fl{Lp{an) - Lp{a{P))) > csT^} 
for ah PeV. Take any U and set T := {U/c^Y^'^. In view of (6.4), 
hminf supP"[72(Lp(a„) - Lp{a{P))) > U] 

> liminf sup P"[7„|a„ - aiP)\ > Tl. 
P&J 

Now, Corollary 1.4 follows from Theorem 1.1, taking ?7 := S*, C2 := cf C3 and 
7„ :=ni/3. 

Similarly, Corollary 1.5 is a consequence of Theorem 1.2, by taking U : = 
and 7„ :=/?„. □ 

Proof of Corollary 1.6. From the upper bound in (6.1), it follows 
that 



(6.5) {n2/3(Lp(a„) - Lp{a{P))) > T} C {n^'\an - a{P)\ > ^T/cw}. 
Now, we have 

lim sup limsupQ"[n2/3(LQ(a„) - LQ(a(Q))) > T] 

L-^ooQ^^f n— >oo 

(6.6) 



< lim sup limsupQ"[n^/^|a„ - a{P)\ > Jt/cw] = 0, 
where we have used the claim (1.7) of Theorem 1.3 in the last step. □ 
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7. General theory for higher dimensions. In this section we explain 
briefly how one can generalize our theory to higher dimensions. The full 
discussion of the generalization is beyond the scope of this paper. Consider 
a statistical model V, that is, a class of probability measures over a mea- 
surable space {Vt,A). As an example, one could think of the law of a sample 
drawn according to two unknown smooth densities over a higher dimensional 
space, intersecting each other transversally in a hypersurface. Consider an- 
other measurable space {'H,B{'H)) and a loss function 

(7.1) L-.VxH^R, iP,h)^Lp{h). 

Assume that, for each P gV, there exists a minimizer hp G TC, that is, 

(7.2) Lp{hp)<Lp{h) forall/iGW. 

Set Ap(/i) :=Lp{h)-Lp{hp). For 7 > 0, set A]>(/i) := min{7, Ap(/i)}. The 
following lemma generalizes Lemma 2.2. 

Lemma 7.1. Let P,Q eV and 7 > such that Ap{h) + Aq(/i) > 7, 
V/i G 7i. Then for any 6 G (0, 1/2) and for any estimator hn-^^ ^ TC, at 
least one of the following two statements holds: 

(7.3) Epn{AliK))>6j 
or 

(7.4) Eq.{AICK)) > (i - 5)7exp(-2n/7(P,Q) - 1). 

Proof. Note that Ap and Aq take values in [0, 7]. It is easily seen that 

A],(/i) A^(/i) > 7, for ah hGTi. Hence, Ep,.{A},{hn) + Aj^iK)) > 7, and 

for any 6 G (0, 1/2), we have Epn{A'},{K)) > S-f or Ep^{A]^{hn)) > (1 - (5)7. 
By Lemma 2.1, we know that 

-EQn{Al{hn)) > (-Epn{Alihn)) - ^) exp(-2n//(P,Q) - 1) 

(7-5) ] 

= -(Epn{Al{K)) - exp(-2n/?(P,g) - 1). 

So, if Epn{A]^CK)) > (1 - 5)7, then 

EQn{Al{hn)) > ((1 - - I) exp{-2nH{P, Q) - 1) 

(7.6) 

= (l/2-5)7exp(-2n//(P,Q)-l). □ 

In a higher dimensional setup, beyond the scope of this paper. Lemma 7.1 
can be used as a replacement for Lemma 2.2. The assumption /3„|a(P) — 
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o(Qn.)| > 4 then becomes inf/i(Ap"(/i) + Aq"^(/i)) > 7^, where 7^ depends on 
the problem. The lower bounds are obtained for the probability of Ap"(/i„) 
being large, where /i„ is an estimator. 
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